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We develop appropriately generalized notions of indexability for 
problems of dynamic resource allocation where the resource con- 
cerned may be assigned more flexibility than is allowed, for example, 
in classical multi-armed bandits. Most especially we have in mind 
the allocation of a divisible resource (manpower, money, equipment) 
to a collection of objects (projects) requiring it in cases where its 
over-concentration would usually be far from optimal. The resulting 
project indices are functions of both a resource level and a state. 
They have a simple interpretation as a fair charge for increasing 
the resource available to the project from the specified resource level 
when in the specified state. We illustrate ideas by reference to two 
model classes which are of independent interest. In the first, a pool of 
servers is assigned dynamically to a collection of service teams, each 
of which mans a service station. We demonstrate indexability under 
a natural assumption that the service rate delivered is increasing and 
concave in the team size. The second model class is a generalization 
of the spinning plates model for the optimal deployment of a divis- 
ible investment resource to a collection of reward generating assets. 
Asset indexability is established under appropriately drawn laws of 
diminishing returns for resource deployment. For both model classes 
numerical studies provide evidence that the proposed greedy index 
heuristic performs strongly. 

1. Introduction. A notable, now classical, contribution to the theory of 
dynamic resource allocation was the elucidation by Gittins [8, 9] of index- 
based solutions to a large family of multi-armed bandit problems (MABs). 



Received August 2009; revised March 2010. 
Supported by EPSRC Grant EP/E049265/01. 
2 Supported by an RCUK Fellowship. 

AMS 2000 subject classifications. Primary 68M20; secondary 90B22, 90B36. 

Key words and phrases. Asset management, dynamic programming, dynamic resource 
allocation, full indexability, index policy, Lagrangian relaxation, monotone policy, queue- 
ing control. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Applied Probability, 
2011, Vol. 21, No. 3, 876-907. This reprint differs from the original in pagination 
and typographic detail. 



1 



2 



K. D. GLAZEBROOK, D. J. HODGE AND C. KIRKBRIDE 



This is a class of models concerned with the sequential allocation of effort, 
to be thought of as a single indivisible resource, to a collection of stochastic 
reward generating projects (or bandits as they are sometimes called) . Gittins 
demonstrated that optimal project choices are those of highest index. There 
is no doubt that the idea that strongly performing policies are determined 
by simple, interpretable calibrations (i.e., indices) of decision options is an 
attractive and powerful one and offers crucial computational benefits. There 
is now substantial literature describing extensions to and reformulations of 
Gittins' result. Some key contributions are cited in the recent survey of 
Mahajan and Teneketzis [14]. 

Whittle [21] introduced a class of restless bandit problems (RBPs) as a 
means of addressing a critical limitation of Gittins' MABs, namely, that 
projects should remain frozen while not in receipt of effort. In RBPs, projects 
may change state while active or passive though according to different dy- 
namics. However, this generalization is bought at great cost. In contrast 
to MABs, RBPs are almost certainly intractable having been shown to be 
PSPACE-hard by Papadimitriou and Tsitsiklis [16]. Whittle [21] proposed 
an index heuristic for those RBPs which pass an indexability test. This 
heuristic reduces to Gittins' index policy in the MAB case. Whittle's in- 
dex emerges from a Lagrangian relaxation of the original problem and has 
an interpretation as a fair charge for the allocation of effort to a particular 
project in a particular state. Weber and Weiss [20] established a form of 
asymptotic optimality for Whittle's heuristic under given conditions. More 
recently, several studies have demonstrated the power of Whittle's approach 
in a range of application areas. These include the dynamic routing of cus- 
tomers for service [2, 10], machine maintenance [13], asset management [11] 
and inventory routing [1]. 

The above classical models and associated theory are undeniably pow- 
erful when applicable. However, the scope of their applicability is heavily 
constrained by the very simple view the models take of the resource to be 
allocated. As indicated above, in Gittins' MAB model a single indivisible 
resource is allocated wholly and exclusively to a single project at each de- 
cision epoch. In Whittle's RBP formulation, parallel server versions of this 
are allowed. Many applications, however, call for the allocation of a divisible 
resource (e.g., money, manpower or equipment) in situations where its over 
concentration would usually be far from optimal. This is the case, for exam- 
ple, in the problem concerning the planning of new product pharmaceutical 
research which was discussed by Gittins [9] and which provided practical 
motivation for his pioneering contribution. This paper records the first out- 
comes of a major research program whose goal is to develop a usable and 
effective index theory for such problems. 

In Section 2 we present a general model for dynamic resource allocation. 
Both Gittins' MABs and Whittle's RBPs may be recovered as special cases 
as may the recent model of [12] which extends Gittins' MABs such that 
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bandit activation consumes amounts of the available resource which may 
vary by bandit and state. Our general model allows for resource to be applied 
at a range of levels to each constituent project, subject to some overall 
constraint on the total rate at which resource is available. A notion of (full) 
indexability which generalizes that of Whittle for RBPs is developed. Any 
project which is fully indexable has an index which is a function both of a 
given resource level (a) and of a given state (x). The index W(a,x) may be 
understood as a fair charge for raising the project's resource level above a 
when in state x. We discuss how to use such indices to develop heuristics 
for dynamic resource allocation when all projects are fully indexable. 

In Sections 3 and 4 we use the ideas and methods of Section 2 to con- 
struct index heuristics for the dynamic allocation of a divisible resource in 
the context of two model classes which are of considerable interest in their 
own right. In Section 3 we deploy the framework of Section 2 to develop 
heuristics for the dynamic allocation of a pool of S servers to K service sta- 
tions (or customer classes) at which queues may form. This model is able to 
capture situations where, for example, each of K customer classes is served 
by a dedicated team of specialists. Additionally, S higher level generalist 
servers are available for deployment across the customer classes to supple- 
ment the specialist teams as demand dictates. Deployment of ak generalists 
to customer class k enhances the local specialist team which then delivers 
service collectively at rate Hk( a k)- An assumption that the service rate func- 
tions Hk are increasing and concave reflects a law of diminishing returns as 
service teams grow. The problem of determining how the pool of generalists 
should be deployed across the customer classes in response to queue length 
information is formulated as a dynamic resource allocation problem of the 
kind discussed in Section 2. The analysis which establishes full indexability 
in Section 3 markedly adds to the queueing control literature in establishing 
monotonicity with respect to service costs of optimal policies for a derived 
problem involving a single queue. An algorithm is given for the computation 
of indices. A numerical study provides evidence that a greedy index heuris- 
tic for allocating the common service pool is close to optimal throughout a 
numerical study featuring nearly 10,000 two station problems. 

The model class studied in Section 4 generalizes the so-called spinning 
plates model discussed by Glazebrook, Kirkbride and Ruiz-Hernandez [11]. 
It is a flexible finite state model class in which a divisible investment resource 
is available to drive improvements to the (reward) performance of K reward 
generating assets, which in the absence of any such resource deployment will 
tend to deteriorate. Positive investment both arrests an asset's tendency to 
deteriorate and enhances asset performance by enabling movement of the 
asset state toward those in which its reward generating performance will be 
stronger. Full indexability for assets is established under laws of diminishing 
returns as asset investment levels grow. This considerably extends the work 
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of Glazebrook, Kirkbride and Ruiz-Hernandez [11]. A numerical study which 
features 14,000 two asset problems testifies to the strong performance of the 
greedy index heuristic in comparison to optimum and to competitor policies. 
Conclusions and proposals for further work are discussed in Section 5. 

2. A model for dynamic resource allocation. We propose a semi-Markov 
decision process (SMDP) formulation {(fi/t, Q,., r/%, q^), 1 <k< K} of the 
problem of dynamically allocating a resource to a collection of K stochastic 
projects. This formulation includes Gittins' MABs and Whittle's RBPs as 
special cases. In our SMDP project k is characterized by its (finite or count- 
able) state space O^, its highest activation level € Z + , cost rate function 
Cfc : {0, 1, . . . , Lk} x £l k — > R + , resource consumption function r^ : {0, 1, . . . , 

Lfc} x — > R + and Markov transition law q^. The model is in continuous 

K 

time. We use Xk,x' k £ Qk for generic states of project k and x, x' £ X k=1 fl k 
for generic states of the process. In the SMDP an action a = (oi, 02, ■ ■ ■ , clk) 
must be taken at time and after each (state) transition of the process. 
This specifies the resource level a*. € {0, 1, ... , L^} to be applied to project 
k, 1 < k < K. The choice = indicates that resource at a minimal level 
(usually none) is to be applied to k (k is passive), while the choice a& = 
indicates a maximal resource allocation. Resource level a& applied to project 
k when in state Xk leads to a consumption of resource at rate rk(a,k,Xk), with 
r k{', x k) increasing Vk,Xk- In the major examples discussed in the upcom- 
ing sections we will have rk(ak,Xk) = at V/c,Xfc and the resource level is 
identified with the resource consumed. When resource level au is applied to 
project k when in state x^, it incurs costs at rate Cfc(afc,Xfe). Both cost and 
resource consumption rates are additive over projects. It will be convenient 
to write c(a,x) = ^2 k Ck(dk,Xk) and r(a, x) = rk(a>k, %k)- The set of ad- 
missible actions in process state x is given by -A(x) = {a;r(a, x) < R} where 
R is the rate at which resource is available to the system, assumed constant 

K 

over time. We suppose that -A(x) ^ 4>, x € X k=1 Q k . An admissible policy is 
a rule for taking admissible actions. 

Should action a be taken when the system is in state x, the system will 
remain in state x for an amount of time which is exponentially distributed 
with rate 



The transition following will be from state Xf~ to state x' k within project k 
with probability 



K 





GENERAL NOTIONS OF INDEXABILITY 



5 



Hence the projects evolve independently, given the choice of action, with g& 
yielding transition rates for project k. The goal of analysis is the determina- 
tion of a policy for resource allocation (a rule for taking admissible actions at 
all decision epochs) which minimizes the average cost per unit time incurred 
over an infinite horizon. 

To develop ideas and notation we use U for the set of deterministic, sta- 
tionary, Markov (DSM) and admissible policies determined by functions 

u with domain X k=l fij. which satisfy u(x) E ^4(x) Vx. Fix u € U. We 
shall also use {X(t),t > 0} for the system state evolving over time and 
[u{X(i)},i > 0] for the corresponding stochastic process of admissible ac- 
tions taken by u. We write 

(1) C(u,x) = liminf-f I E*c(u{X(s)},X(s))cis 

t^oo t \Jq 

for the average cost per unit time incurred under policy u over an infinite 
horizon from initial state x. In (1) E* denotes an expectation taken over 
realizations of the system evolving under u from initial state x. We shall 
assume the existence of a policy u £ U such that C(u,x) < oo Vx and write 
C op (x) for the minimized cost rate, namely, 

(2) C opt (x) = inf C(u,x). 

uGU 

We shall use the term optimal to denote a policy (assumed to exist) which 
achieves the infimum in (2) uniformly over initial states. This applies both 
to the problem in (2) and also to the derived optimization problems we shall 
discuss later in the account. In the model classes featured in Sections 3 and 4 
it will be the case that the average costs in (1) and (2) are independent of x. 
Henceforth, for simplicity, we shall suppress dependence on the initial state 
x in the notation. 
We shall use 

(3) i?(u)=liminf-( f E u r(u{X(s)}, X(s)) ds 

t->oc t \J 

for the average rate at which resource is consumed under policy u. We also 
write 

K K 

(4) C(u) = Y,C k (u), R(u) = Y,Rk(u) 

fc=i fc=i 

to give a disaggregation of the cost and resource consumption rates into the 
contributions from individual projects. 

In principle, the tools of dynamic programming (DP) are available to de- 
termine optimal policies. See, for example, [17]. However, direct application 
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of DP is computationally infeasible other than for small problems (crucially, 
small K). Hence, our primary interest lies in the development of heuristic 
policies which are close to cost minimizing. To this end we relax the opti- 
mization problem in (2) by extending the class of policies from the DSM 

- K K 

admissible class U to those DSM policies u:X k=1 Q fc — > X k= i{0, 1, . . . ,L k } 
which consume resource at an average rate which is no greater than R. 
Hence, we write 

K 

(5) C°^ = mfTc k (n), 

u ' » 

k=l 

where in (5), the infimum is taken over the collection of DSM policies satis- 
fying 

K 

(6) £i? fe (u)<#. 

k=l 

We now relax the problem again by further extending the class of policies 
and by incorporating the constraint (6) into the objective (5) in a Lagrangian 
fashion. We write 

K 

(7) C(W) = mfY{C k (u) + WR k (u)}-WR. 

u ' 

k=l 

In (7) the infimum is taken over the class of DSM policies u:X^L 1 f2/ c — > 

X fc =1 {0, 1, . . . , L k } which allow, for each project k, a free choice of action 
from the set {0, 1, . . . , L k } at each decision epoch. It is clear that 

C{W) <C opt <C opt , WeR + . 

However, the Lagrangian relaxation of our optimization problem expressed 
by (7) admits, on account both of the policy class involved and the nature 
of the objective, an additive project-based decomposition. Expressed differ- 
ently, an optimal policy for (7) operates optimal policies for the individual 
projects in parallel. In an obvious notation we write 

K 

(8) C(W) = Y,C k (W)-WR, 

k=l 

where 

(9) C k (W) = mf{C k (u k ) + WR k (u k )}, l<k<K. 

The optimization problem in (9) concerns project k alone. We denote it 
P(k,W). In its objective the Lagrange multiplier W plays the role of a 
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charge per unit of time and per unit of resource consumed. An optimal policy 
u k (W) for P{k,W) minimizes an aggregate rate of project costs incurred 
and charges levied for resource consumed. Further, the policy u(W) which 
applies UkiW) to each project k, achieves C(W) in (7) and hence provides 
a solution to the above Lagrangian relaxation. Note that in what follows 
we shall use the notation u(W,x),u k (W,Xk) to denote the action (resource 
consumption levels) chosen by DSM policies u(W),u k (W) in states x,x k , 
respectively. 

In order to develop natural project calibrations (or indices) which can 
facilitate the construction of effective heuristics for our original problem (2), 
we seek optimal policies for the problems {P(k,W),W G M + ,l <k< K} 
which are structured as in Definition 1 below. We first require additional 
notation. Write 

(10) U k {u k {W),a} = {xen k ;u k (W,x)<a}, a G {0, 1, . . . , L k - 1}, 

for the set of project k states for which policy u k (W) chooses to consume 
resource at level a or below. 

Definition 1 (Full indexability). Project k is fully indexable if there 
exists a family of DSM policies {u k (W),W G M + } such that u k (W) is op- 
timal for P(k,W) VW and H k {u k (W),a} is nondecreasing in W for each 
oG{0,l,...,L fc -l}. 

To summarize the requirements of Definition 1, a project k will be fully 
indexable if the problem P(k, W) has an optimal policy which, for any given 
state, consumes an amount of resource which is decreasing in the resource 
charge W . Full indexability enables a calibration of the individual projects 
as described in Definition 2. 

Definition 2 (Project indices). If project k is fully indexable as in 
Definition 1, a corresponding index function W k '■ {0, 1, . . . , L k — 1} X Q k — > 
M + is given by 

(11) W k (a,x)=mf[W;xeIL k {u k (W),a}]. 

Remark. The index W k (a,x) can be thought of as a fair charge at 
project k for raising the resource level from a to a + 1 in state x. Were 
a resource charge less than W k (a,x) to be levied, the consumption of the 
additional resource would be preferable, while if the resource charge were to 
be in excess of the index, that would not be the case. We shall adopt the 
convention that the index function is extended to W k :{— 1,0,1,. . . ,L k } x 
n k -> R+ U {oo} where W k (-1, x) = oo, W k (L k ,x) = Vx G n k . 

The following is a simple consequence of the above definitions. Its proof 
is omitted. 
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Lemma 1. If project k is fully indexable, the index Wk(a,x) is decreasing 
in a, for fixed x. 

Hence, under full indexability, the fair charge for raising the resource level 
for project k in any state x from a to a + 1 is decreasing in the resource 
level a. 

We now return to consideration of the Lagrangian relaxation in (7) and (8) 
and suppose that all K projects are fully indexable with families of optimal 
policies 

{u k {W),W eR + ,l< k<K} 

structured as in Definition 1. Under full indexability, all of these policies 
have a structure describable in terms of the index functions W k , 1 < k < K . 
Theorem 2 now follows. 

Theorem 2. Suppose that all K projects are fully indexable with ex- 
tended index functions W k ■ {— 1, 0, 1, ... , L k } M + U {oo}. The policy 
u(W) such that 

u(W,x) = a W k (a k -l,x k )>W>W k (a k ,x k ), 

l<KA>eXf =1 fi fc , 

achieves C{W) VW G R + . 

Remark. According to Theorem 2, policy u(W) constructs actions (al- 
locations of resource) in each system state by accumulating resource at each 
project until the fair charge for adding further resource drops below the pre- 
vailing charge W. This is strongly suggestive of how effective, interpretable 
heuristics for our original dynamic resource allocation problem based on 
the above indices (fair charges) may be constructed when all projects are 
fully indexable. A natural greedy index heuristic constructs actions in every 
system state by increasing resource consumption levels in decreasing order 
of the above station indices until the point is reached when the resource 
constraint is violated by additional allocation of resource. 

Formally the greedy index heuristic is structured as follows: 

Greedy index heuristic. In state x the greedy index heuristic constructs 
an action (allocation of resource) as follows: 

Step 1. The initial allocation is = {0,0, . . . ,0}. The current allocation is 
a = {ai,a 2 ,...,a K } with ^2 k r k (a k ,x k ) < R. 

Step 2. Choose any k satisfying 

W k (a k ,x k ) = max Wj(dj,Xj). 
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Wi(4,5) Wi(3,5) Wi(2,5) 


Wi(l,5) 


Wi(0,5) 




* 


* * 


W 2 (4,2) 


Vy 2 (3,2) VK 2 (2,2) 


W 2 (l,2) W 2 (0,2) 



w 



Fig. 1. Index values for state x= (5,2). 

5*iep 3. If efc denotes a K-vector whose fcth component is 1 with zeroes 
elsewhere, the new deployment is a + if 

(12) y^ri(ai,xi) + r k (a k + l,x k ) < R. 

l^k 

If there is strict inequality in (12), return to Step 1 and repeat. Otherwise, 
stop and declare a + to be the chosen action in x. If 

^2n(ai,xi) + r k (a k + l,x k ) > R, 

l^k 

stop and declare a to be the chosen action in x. 

Remark. We shall use Figure 1 to illustrate the construction of actions 
by both the policy u(VF) (as in Theorem 2) and the greedy index heuristic 
in a simple problem with K = 2 in which both projects are fully indexable. 
Section 3 discusses a class of models in which r k (a k ,x k ) = a k Mk,x k and 
where all projects have state space N and a common maximum resource 
level, L say, which is equal to R, the total rate at which resource is available. 
Suppose now that L = R = 5 in such a model and that the system state 
is x = (xi,X2) = (5,2). Figure 1 indicates values of the appropriate project 
indices Wi(a, 5) and W2(a, 2) for the range < a < 4 together with the value 
of the Lagrange multiplier W. 

The policy u(VF) will make allocations of resource supported by those 
index values which are above W. Hence from Figure 1, the choice of action 
in state x= (5,2) will be a = (2,4). This is an inadmissible action for the 
original problem since the total resource rate allocated (6) exceeds that avail- 
able (5). The greedy heuristic makes allocations of resource supported by 
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the five largest index values (indicated by * in Figure 1). Plainly, the action 
taken by the index heuristic is a = (2,3). As the system state evolves under 
the operation of either policy, the index values change as do the implied 
actions. 

The major challenge to implementation of the above program for heuristic 
construction is the identification of optimal policies for the problems 

{P(k, W), 1 < k < K, w e 

which meet the requirements of Definition 1. In Sections 3 and 4 we are able 
to achieve this in the context of two model classes for which we are able to 
establish an appropriate form of full indexability. For the Section 3 problem, 
we also give an algorithm for index computation. For both model classes we 
proceed to assess the performance of the greedy index heuristic in extensive 
numerical studies. 

Remark. We recover Whittle's RBPs [21] by making the choices r^a^, 
x k) = a fci L/. = 1, 1 < k < K and R < K in the above. Hence there are just 
two modes of activation (active, passive) of each project, with R projects to 
be made active at each epoch. For this special case the above greedy index 
heuristic is precisely the index heuristic proposed by Whittle. If we make 
the further choice R = 1 and impose the requirement that projects can only 
change state under the active action, we then recover Gittins' MAB [8] and 
its associated (optimal) index policy. 

3. The optimal allocation of a pool of servers. We illustrate the above 
ideas by considering a set-up in which service is provided at K service sta- 
tions. These stations could represent distinct geographical locations or fa- 
cilities dedicated to the service of a particular class of customer. Customers 
arrive at the stations in K independent Poisson streams, with the rate 
for station k. A pool of S servers is available to support service at the K 
stations. Should a servers from the pool be allocated to station k at any 
point, the resulting exponential service rate is //fc(a). Note that there may 
be a local team of servers permanently stationed at k (i.e., in addition to 
any allocated from the pool) in which case we will have /x^(0) > 0. Please 
note also that we shall suppose that all servers (whether permanently based 
at a location or allocated there from the common pool) offer service as a 
team, namely, that they act in concert as a single server. The goal of analysis 
is the determination of a policy for deploying the common service pool in 
response to queue length information to minimize some linear measure of 
holding cost rate for the system incurred over an infinite horizon. 

More formally, the system state at time t is n(i) = {m(t), ri2(t), ■ ■ ■ ,n^(t)} 
where n&(i) is the number of customers at service station k (including any 
in service) at t. We shall on occasion refer to ni~(t) as the head count at 
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station k at time t. This system state is observed continuously. The decision 
epochs for the system are time zero and the times at which the system state 
changes. At each decision epoch, some action a= (ai, 02, • ■ • , or:) is taken, 
where a k € N, 1 < k < K, and J2k a k < S- Action a denotes the deployment 
of a k servers from the central pool to service station k, 1 < k < K. Should 
action a be taken in state n then an exponentially distributed amount of 
time with rate 

(13) A = ^{A fe + ^(a fe )/(n fc >0)} 

k 

will elapse before a change of state. In (13) I is an indicator function. The 
next state of the system will be n + e k with probability A&/A and will be 
n — e k with probability fik(cL k )I(n k > 0)/A, 1 < k < K. 

A DSM admissible policy is given by a map u : — > S, where 

(14) H = ja; a k G N, 1 < k < K, and ^ a k < S j 

and is a rule for choosing admissible actions as a function of the current 
system state. The cost associated with policy u is given by 

(15) C(u) = ^h k N k (u), 

k 

where the h k are positive weights (holding cost rates) and N k (u) is the time 
average number of customers at station k under policy u. The optimization 
problem of interest is given by 

(16) C opt = inf C(u), 

ueu 

where in (16) the infimum is over the set U of DSM admissible policies. 

We pause to note that this problem does indeed belong to the class of 
dynamic resource allocation problems described in the preceding section. 
We make the choices c k (a k ,n k ) = h k n k ,r k (a k ,n k ) = a k ,L k = S, 1 < k < K, 
with the transition rates qk{n' k \ n k ,a k ) satisfying 

q k (n k + 1 I n k ,a k ) = \ k , 

q k (n k - 1 I n k ,a k ) = fj, k (a k )I(n k >0), 

for all choices of k,n k and a k . They are otherwise zero. One thing which is 
special about this problem is that it is possible to utilize all of the resource 
which is on offer all of the time. It is plainly optimal to do so. Hence, in (14), 
we can restrict admissible actions to those which deploy all servers from the 
pool. 

Before proceeding to develop appropriate notions of full indexability/indi- 
ces, we describe assumptions we shall make about our service rate func- 
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tions /Xfe(-). In Assumption 1 we use the notation 

\x\ = min{y; y G Z + and y > x}, x G M + . 

Assumption 1. There exist functions fl k :M + — > M + which are strictly 
increasing, twice differentiable and strictly concave, satisfying 

(17) fl k (a) = fi k (a), a£[0,S]nN, 
and 

(18) J2^k\ X k)}<S. 

k=l 

From (17) the functions £i k , 1 < k < K, are smooth extrapolations of the 
service rates on the integers in the range [0,5]. The properties of these 
functions reflect the fact that, while an increase in the size of the team at 
a station results in a higher service rate, the marginal benefit of adding 
an additional member diminishes as the team size grows. Requirement (18) 
guarantees the existence of stable policies under which all queue lengths 
remain finite. 

Remark. It is the assumption of strict concavity of the service rate 
functions at each station which stimulates an active approach to the distri- 
bution of the pool of servers around the stations and which makes this an 
interesting problem. Had we assumed, for example, that the service rates 
were all convex in the team size, then [18] shows that in an optimal pol- 
icy the service pool would always be allocated en bloc and we are driven 
back to the "single server" world of the simple bandit models. This result 
is intuitively obvious, as observed by Richard Serfozo to Sobel: "the fastest 
rate is also the cheapest." Indeed, the resulting service control problem has 
a well-known solution in the form of the so-called c^-rule. (See [3].) 

We are able to develop a Lagrangian relaxation of the problem in (15) 
and (16) as in the preceding section. As in the analysis of Section 2 up 
to (8), such a relaxation yields K optimization problems P(k,W), one for 
each station, which here take the form 

(19) C k {W) = inf {h k N k (u k ) + WS k (u k )}, 

where in (19), the infimum is over the class of DSM policies u k : N — > {0, 1, . . . , 
S} which can deploy any number of servers (up to S) at station k at each 
epoch, N k (u k ) is the time average head count and S k (u k ) the time average 
number of servers deployed at k under policy u k . The optimization problem 
in (19) concerns station k alone and seeks to choose, at each station k 
decision epoch and in response to queue length information for station k, 
the number of servers (from the S available) to be deployed there. The goal 
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is to make such choices to minimize costs which are an aggregate of those 
incurred through customers waiting [hi-N^Uk)] and charges imposed for the 
provision of service [W 'Sk(uk)]- Note that Lagrange multiplier W here has an 
economic interpretation as the charge imposed per server per unit of time. 

We now wish to develop index heuristics for our service allocation problem 
by developing station indices of the form described in the preceding section. 
These flow from the property of full indexability defined with respect to 
solutions to the problems P(k, W), 1 < k < K , and described in Definition 1. 
However, full indexability is a property of individual stations and hence we 
now focus on a single station and drop the station identifier k until further 
notice. For clarity, the single station problem P(W) is formulated as an 
SMDP as follows: 

1. The state of the system at time t € M + is n(t), the number of customers 
(head count) at the station. New customers arrive at the station according 
to a Poisson process of rate A. 

2. Decision epochs occur at time and whenever there is a change of state. 
At each such epoch an action from the set A = {0, 1, . . . , S} is chosen. 
Should action a € A be chosen at time t at which point n(t) = n > then 
costs will be incurred from t at rate hn + Wa and the first event following 
t will occur at time t + X where X ~ exp[A + [i(a)\- With probabilities 
A[A + /i(a)] -1 and //(a) [A + //(a)] -1 the event will be, respectively, an 
arrival or a service completion. 

3. The goal of analysis will be the determination of a stationary policy to 
minimize the average cost rate incurred over an infinite horizon. Triv- 
ially, optimal policies offer no service (a = 0) when the system is empty 
[n(t)=0]. 

The quest for full indexability is greatly simplified in this case by the 
existence of optimal policies for P(W) for which the choice of number of 
servers is increasing in the current head count. We call such policies mono- 
tone. This conclusion follows from Theorem 4 in Stidham and Weber [19], 
which applies to a queueing system with state space N and Poisson arrivals 
with an objective which combines a holding cost which is both increasing 
in the state and unbounded, with action costs which are nonnegative and 
increasing in the resource level. All of these requirements hold in P(W). 
Stidham and Weber's analysis first considers the problem of choosing a pol- 
icy to minimize the expected cost incurred in moving the system from a 
general initial state to the empty state (their Theorem 2) and then deploys 
arguments from renewal theory to demonstrate that such a policy will also 
minimize long run average costs (their Section 1.3). We state our conclusion 
as Proposition 3. 

Proposition 3 (Stidham and Weber). There exists a monotone policy 
which is optimal for P(W) . 
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The problem of establishing monotonicity with respect to queue size of 
optimal policies for service control problems for queues with Poisson input 
is not new. In addition to Stidham and Weber [19], see [4-7, 15]. While such 
monotonicity is helpful in establishing full indexability and in the subsequent 
computation of index functions, it is not the key to proving the latter. This 
is rather the demonstration (to which we now proceed in Section 3.1) that 
optimal policies for P(W) are monotone in W. Proving this significantly 
extends the literature on service control problems for M/M/l queues. 

3.1. Stations are fully indexable. In light of Proposition 3 we can recast 
and simplify the requirements of full indexability expressed in Definition 1. 
Let u(W) be an optimal policy for P(W) which is monotone. It follows that 
for all choices of W £ M + and < a < S - 1, 

U{u(W),a} = {n G N; u{W, n) < a} = {0, 1, . . . , N(a, W)} 

for some N(a, W) € N U {oo}. We now have the following: 

Definition 3 (Full indexability). The station will be fully indexable if 
there exists a family of DSM policies {u(W), W € R + } for which (i) u(W) 
is monotone and optimal for P(W) \/W € M + and (ii) the corresponding 
N(a, W) is increasing in W, Va 6 {0, 1, . . . , S — 1}. 

To summarize the requirements of Definition 3, a station will be fully 
indexable if the service charge problem P(W) has a monotone optimal policy 
for which the number of servers deployed is decreasing in the service charge 
W for any given head count. Full indexability enables a calibration of the 
individual stations as described in Definition 4. 

Definition 4 (Station indices). If the station is fully indexable, the 
corresponding index function W: {0, 1, . . . , S — 1} x N — > M + is given by 

(20) W{a,n) = ini{W;n< N(a,W)}. 

In light of Proposition 3 above, Lemma 1 may be extended as follows in 
this case: 

Lemma 4. If the station is fully indexable, the index W(a,n) is (i) de- 
creasing in a for fixed n and (ii) increasing in n for fixed a. 

Please note that optimal policies for P(W) will be unchanged if all cost 
rates (both holding costs and service charges) are divided by W > through- 
out. When we do that, we see that increasing W is equivalent to decreasing 
the holding cost rate h in problems for which the service charge rate is fixed. 
This being so, we develop the following convenient reformulation of the def- 
inition of full indexability above: refer to the problem obtained by setting 
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W = 1 in the above [namely -P(l)] as Q(h) to emphasize dependence on the 
holding cost parameter h. Hence, Q(h) is the problem given by 

C(h) = w£{hN(u) + S(u)}. 

u 

From Proposition 3 we are able to assert the existence of optimal policies for 
Q{h) which are monotone. The following is trivially equivalent to Definition 3 
above. 

Definition 5 (Full indexability — alternative definition). The station 
will be fully indexable if there exists a family of DSM policies {u(h), h 6 R + } 
such that, (i) u(h) is optimal for Q(h) V7i £ R + ; (ii) each u(h) is monotone 
with 

U{u(h), a} = {0, 1, . . . , M(a, h)}, 
where M(a, h) is deceasing in h Va G {0, 1,...,S — 1}. 

To summarize, to achieve full indexability, instead of requiring (according 
to Definition 3) that the optimal service level decreases with the service 
charge W (for a fixed value of the holding cost rate h) , we now equivalently 
require it to increase with the holding cost rate h (for fixed service charge 
W = 1). This reformulation of full indexability which focuses attention on 
the holding cost element of the objective yields a more accessible account. 

We begin this part of our analysis by noting that it is easy to establish that 
any optimal policy u(h) for Q(h) must be such that /j,{u(h,n)} > 0, n > 1. 
It follows that the head count process is ergodic under its operation. We 
uniformize station evolution by rescaling time such that 

X + H(S) = 1. 

Under this uniformization, the DP optimality equations for the problem 
Q{h) are as follows: 

Xv(h, n) =hn + Xv(h, n + 1) 

(21) 

+ min{a — [i(a)[v(h,n) — v(h,n — 1)]} — 7(/i), n > 1, 

a 

where the minimum in (21) is over the range < a < S. Note that in (21) 
the quantity j(h) is the minimized cost rate for Q(h) with v(h, ■) the cor- 
responding bias function, where v(h,0) =0. If we write C(h,n,t) for the 
minimum total cost incurred in Q{h) during [0,i) when ra(0) = n, then we 
have C(h, n, t) ~ t^(h) + v(h, n). 

Action a is optimal for Q(h) in state n if and only if it achieves the 
minimum in (21). To proceed further, we write Av(h,n) =v(h,n) — v(h, 
n — l),n > 1, and Av(h,0) = 0. Hence (21) now becomes 

(22) - XAv(h,n+ 1) = hn + mm{a - fi(a)Av(h,n)} -j(h), n > 0. 
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We note in passing that it is trivial to deduce from the inductive speci- 
fication of Av(h, •) given by the optimality equations, that the quantities 
{Av(h, n), n > 1} are well defined, including in the event that there are sev- 
eral optimal policies for Q(h). The following is an immediate consequence 
of (22). 

Lemma 5. A DSM policy u is optimal for Q(h) if and only if 
Av(h, n) [/i(u(n) + 1) — ^{u(n))] 

(23) 

< 1< Av(h,n)[n{u(n)) - fi{u(n) - 1)], n> 1, 
where /j,(S + 1) = n(S) in (23). 

Please note that if a policy u is such that the inequalities in (23) are all 
strict then it is uniquely optimal and so must be monotone by Proposition 3. 
Should the left-hand inequality be satisfied as an equation for some n with 
u{n) < S, then both u{n) and u{n) + 1 are optimal choices of action in 
state re. To develop the analysis further we need information regarding the 
quantities Av(h,n) when viewed as functions of h. 

Lemma 6. The function Av(-,n) is continuous Mn > 1. 

Proof. It is trivial to establish that the average cost rate j(h) is con- 
tinuous in h. Observe from (22) that 

Av(h,l) = \~ 1 1 (h) 

and hence Av(-, 1) is continuous. From (22) we also note that it is straightfor- 
ward to establish that, if Av(-,n) is continuous, then so must be Av (•, re + 1), 
n > 1. The result follows by an induction argument. □ 

Now use u{h) to denote any DSM policy which is optimal for Q(h). We use 
T[u(h),n] for the expected time until the system is first emptied under u(h) 
given that n(0) = n. We also use C[u(h),n] for the expected cost incurred 
under u(h) from time when n(0) = n until the system first empties. 

Lemma 7. V7i > 0, 

Av(h, n) > {T(u(h),n) — T(u(h),n — l)}{hn — 7(/i)} — > oo, n — > oo. 

Proof. A standard argument, based on the fact that the system evolv- 
ing under u(h) regenerates upon every entry into the empty state, yields the 
conclusion that 



(24) v{h,n) = C(u{h),n)-~/(h)T(u{h),n), n>l, 
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from which we immediately infer that 

Av(h,n) = {C(u(h),n)-C(u(h),n-l)} 

(25) 

- y(h){T(u(h),n) - T(u(h),n- 1)}, n > 1. 

Consider now the system evolving under u(h) from time when its state 
is n until it enters state n — 1 for the first time. The expected time taken is 
plainly T[u(h), n] — T[u(h),n — 1] and the holding cost rate incurred through 
this period is bounded below by hn. If we write the mean integrated head 
count divided by T[u(h),n] — T[u(h),n — 1] as x[ u (h)i n ] ^ n an d the mean 
total service cost divided by T[u(h),n] — T[u(h),n — 1] as ip[u(h),n] > 1 we 
infer that 

C(u(h),n)-C(u(h),n-1) 

(26) = {h X (u(h),n) + iP(u(h),n)}{T(u(h),n) - T(u(h),n - 1)} 
>hn{T(u(h),n) -T(u(h),n- 1)}, n>l. 

The inequality in the lemma follows immediately from (25) and (26). To 
justify the divergence claim, we simply observe that an assumed permanent 
utilization of the maximum service rate ^(S) implies that {/i(S) — A} -1 is 
a uniform lower bound on T[u(h),n] — T[u(h),n — l],n > 1. The proof is 
complete. □ 

Before proceeding, we observe from (25) and (26) and the definitions of 
the quantities concerned that we may write 

Av(h,n) = [h{ X (u(h),n) - a(u(h)) X (u(h),l)} 

(27) + U>(u(h),n) - a(u(h))^(u(h), 1)}] 

x {T(u(h),n) -T(u(h),n-1)}, n > 1, 

where 

a(u(h)) = T(u(h), l)[T(u(h), 1) + A -1 ] -1 . 
Note that it is straightforward to establish that 

(28) x(u(h),n)>x(u(h),l)>a(u(h)) X (u(h),l), n > 1. 
The following is an immediate consequence of (23) and Lemma 7. 

Lemma 8. V/i > 0, < oo such that u(h, n) = S,n> N^, for all choices 
ofu(h). 

We are now in a position to prove full indexability. The key fact to estab- 
lish is that Av(h,n) is increasing in h for each n > 1. Full indexability will 
then follow trivially from (23). 

Theorem 9 (Full indexability). (i) The function Av(-,n) is increasing 
Vn > 1; (ii) the station is fully indexable. 
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Proof. Fix ho > 0. There are two possibilities. Either there exists a 
monotone policy u(ho) which is uniquely optimal for Q(ho) (case 1) or not 
(case 2). Under case 1, invoking the preceding lemma we may assert the 
existence of Nh < oo such that (23) is satisfied in the form 

Av(h ,n)\p(u(ho,n) + 1) - fi(u(h ,n))] 

< 1 < Av(ho,n)[n(u(h ,n)) - n(u(h ,n) - 1)], 

(29) 

l<n<N ho -l, 

KA W (/i ,JV fto )HS)-^-l)]. 

Since Av(-,n) is continuous for n > 1, it must follow that 3e > with the 
property that the inequalities in (29) are satisfied with h replacing ho for 
all h in the range ho < h < ho + e. We infer from (23) that monotone policy 
u(ho) is uniquely optimal for Q(h),h G (ho, ho + e). If we now consider the 
expression in (27) with a,x,T computed with respect to policy u(ho), it 
follows easily that Av(h,n) is increasing and linear in h over the range 
ho < h < h + e. 

Now consider case 2. Use T(/io) to denote the collection of DSM policies 
which are optimal for Q(ho). From the preceding lemma and invoking the 
strict concavity of //(•), we infer that T(/io) must be finite. Further, the 
continuity of Av(-,n),n > 1, together with (23) implies the existence of 5 > 
such that Q(h) must be optimized by a member of T(/io) for h in the 
range ho < h < ho + 5. Suppose that u € T(/io) optimizes Q(h) for some 
h £ (ho, ho + 6). It then follows from (27) that 

Av(h, n) = [h{x(u, n) - a(u)x(u, 1)} + {ip(u, n) - a(u)ip(u, 1)}] 

(30) 

x {T(u,n) -T(u,n-1)}, n > 1, 

where in (30), a(u),x(u,-),ip(u,-) and T(u,-) denote quantities computed 
with respect to policy u. Hence from (30), it follows that for each n > 
1, Av(-,n) lies on one of a finite collection of straight lines with positive 
gradient [one for each u € Y(/io)] throughout the range ho < h < ho + 5. 
However, the continuity of Av(-, n) implies that it must in fact lie on just one 
of those lines throughout that range. It follows that Av(h,n) is increasing 
linear in h over the range ho < h < ho + S. We conclude from the above 
consideration of cases 1 and 2 that, for each n > l,Av(-,n) is continuous 
with a positive right gradient at each h > and is thus increasing. This 
concludes the proof of part (i). 

For part (ii), we first take the analysis of part (i), case 2, a little further. 
Since for the chosen 5 > 0, Av(h, n) is strictly increasing through [ho, ho + 5) 
for all n > 1 , the only policy which can remain optimal throughout this range 
must satisfy conditions of the form (29). This policy must be maximal (i.e., 
must assign maximal service levels) among those policies in T(ho) and will 
be uniquely optimal for h € (ho, ho + 5) and hence monotone. 
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From the above discussion, we can infer the following: fix any ho > and 
choose the maximal optimal policy for Q{ho). This policy is monotone. Call 
it u(ho). Define h± by 



By the above argument hi > ho and u(ho) is strictly optimal for Q(h),h € 
(ho, h\). Further, if hi < oo,u(ho) is optimal for Q(hi), but not uniquely 
so. We use u(hi) for the maximal DSM policy which is optimal for Q(h\). 
Policy u(h\) is monotone such that 



where (31) means 

u(hi,n) >u(ho,n), n > 1, 

with strict inequality for at least one n. In this way we can develop a se- 
quence ho < hi < /i-2 < • • • < hjy < oo and corresponding monotone policies 
u(h r ),0 <r <N, such that: 

1. u(h r ) is optimal for Q(h), h £ [h r ,h r+ i],0 < r < N — 1; 

2. u(h r+ i, ■) > u{h r , -),0 < r < N - 1; 

3. u(hjy) is optimal for Q(h),h G [/ijv, oo) and is such that u(hN,n) = S, n > 1. 

Since the choice of /io was arbitrary, indexability follows trivially from 1-3. 
This completes the proof of part (ii) and of the theorem. □ 

3.2. Computation of station indices. In the proof of Theorem 9 we con- 
structed an ascending set of /i-values, each of which signaled a change of 
optimal policy for Q(h). In this construction the initial ho was arbitrary. In 
our discussion of index computation, we shall continue initially to operate in 
/i-space [i.e., to consider solutions to the optimization problems Q(h)], but 
will construct a descending set of h- values, labeled ji,j2,... each of which 
will also signal a change of optimal policy. We do this because such a set is 
straightforward to initialize, with j\ the supremum of those h for which the 
policy [hereafter labeled u(jo)] which applies the maximal number of servers 
S whenever the queue is nonempty is not optimal for Q(h). Because of our 
ability to restrict to monotone policies, it is clear that both u(jo) and the 
policy u(ji) (which applies 5 — 1 servers when the queue length is 1, but 
which otherwise applies S servers) are optimal for Q(ji). By direct calcula- 
tion of the average cost rates for these policies it is straightforward to verify 
that 



We now give an algorithm for producing the sequence {j m ,m > 1} and the 
monotone policies {u(j m ),m > 0} such that u(j m ) is strictly optimal for 
Q(h) in the range j m +\ < h < j m . Note that we take jo = oo. In the algorithm 



hi = inf{/i > ho;u(ho) is not optimal for Q(h)}. 



(31) 



u(hi,-) > u(ho,-), 
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we utilize the characterization of optimal policies for Q(h) given in Lemma 5 
together with the formula for Av(h, n) given following the proof of Lemma 7. 

Algorithm for index computation. 

Step 0. Let m = 1. The positive real j± and the policy u{j\) are as above. 
The positive integer N\ is given by 

N\ = min{n; u(j\,n) = S} = 2. 

Step 1. The positive real j m , the policy u(j m ) and the positive integer 
N m given by 

N m = min{n; u(j m ,n) = S} 
are specified. Determine (A™,B™; 1 < n < N m ) given by 
A n = {x(u(j m ),n) - a(u(j m ))x(u(j m ),l)}{T(u(j m ),n) -T{u(j m ),n- 1)} 
and 

B n ={^( u Um),n) - a(u(j m ))tp(u(j m ) , 1)} {T (u(j m ) , n) -T(u(j m ),n- 1)}. 

Step 2. Let j m +i be the maximal h satisfying 

{A%h + B™}{ix(u(j m ,n))-ti(u(j m ,ri) - 1)} = 1 

for some n in the range 1 < n < N m . Let n m be an n- value achieving the 
equality. 

Step 3. Define the policy u(j m+ i) by 

u(j m+1 ,n) =u(j m ,n) - I(n = n m ), n > 0, 

where I is an indicator. Determine iV m _|_i and the (^4™ +1 , B™ +1 ; 1 <n< 
N m+ i) as in Step 1. 

Step 4. If jrn+i < 0, stop. Otherwise return to Step 2. 

It is now straightforward to recover the station indices (as given in Def- 
inition 2) from the quantities calculated by the above algorithm. Note, as 
previously, that optimal policies for P{W) and Q(h/W) coincide. In order 
to compute the station index W(a,n), determine from the above algorithm 
the value j m satisfying 

u(j m , n) = a + 1 and u(j m+1 ,n) = a. 

We then infer that 

W(a,n) = — — . 

Jm+l 

3.3. Numerical study. Extensive numerical investigations have been con- 
ducted on the performance of the greedy index heuristic as a policy for the 
queueing control problems described above. We shall now present some of 
our results as Examples 1 and 2. 
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Table 1 

Choices of the parameters Ai,A2,^ii and 112 (G,J) andv\,vi (7) and 771 , 772 (14) which 
give challenging problem sets for Examples 1 and 2 



G 


J 


7 


14 


Ai G [0.8,1.1) 


Ai G [0.8,1.1) 


i/i G [5.0,10.0) 


771 G [0.07,0.125) 


A 2 G [1.6,2.2) 


A 2 G [1.6,2.2) 


i/ 2 G [0.5,2.0) 


r, 2 G [0.2,0.3) 


A*i G [1.5,1.8) 


fii G [1.5,1.8) 






^2 G [3.0,3.6) 


G [4.4,5.0) 







Example 1. All Example 1 problems concern the dynamic allocation 
of a pool of twenty-five servers (S = 25) to two service stations (K = 2). 
Service rate functions have the form 

(32) fx k (a) = a(a + v k )~ x fj, k , k = l,2. 

In all, 4950 problems were generated at random, consisting of 99 sets of 50 
problems. For each problem the parameters Ai, A2, Hi, ^2, vi, ^2 were chosen 
by sampling independently from uniform distributions. Full details may be 
found at http : / / www . lums . lanes . ac . uk/ f iles/ onlinesup . pdf . 

For each of the 4950 problems generated, indices were developed using 
the algorithm given in Section 3.2. Time average holding cost rates for the 
greedy index heuristic and an optimal policy were computed using DP value 
iteration and the percentage cost rate excess of the index heuristic over the 
optimum was recorded. Order statistics (minimum, lower quartile, median, 
upper quartile, maximum) of the percentage cost rate excess over optimum 
of the index heuristic are given in Table 2 for the 4950 problems overall, to- 
gether with those for two of the problem sets (G7, J 7) for which the heuristic 
performed relatively less well. For ease of reference, Table 1 gives details of 
the uniform distributions used to generate these challenging problem sets. 

Table 2 

The percentage cost rate excess over optimum of (i) the greedy index heuristic 
for all 4950 Example 1 problems, (ii) for problem sets G7 and J7 and 
(iii) for the best static allocation policy 





Overall 


G7 


37 


Static 


MIN 


0.0000 


0.0416 


0.0263 


1.7837 


LQ 


0.0001 


0.0745 


0.0558 


5.6978 


MED 


0.0021 


0.0964 


0.1021 


8.1880 


UQ 


0.0186 


0.1670 


0.1433 


10.9678 


MAX 


0.2910 


0.2910 


0.2422 


22.1868 


N 


4950 


50 


.-,() 


4950 
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Additionally, in Table 2 under the head "Static" are recorded the order 
statistics for the percentage cost rate excess over optimum for the best static 
policy which makes a fixed allocation of servers to stations for all time. These 
latter values give an indication of the potential value of designing a dynamic 
policy for these resource allocation problems. 

The greedy index heuristic performs outstandingly well with a worst case 
suboptimality of 0.2910% for one of the problems generated as part of the 
problem set G7. Inspection of the results for G7 and J7 show that the 
performance of the index policy is excellent even in problems for which 
the stochastic dynamics of the two stations are very different. Perusal of 
the results for individual problems suggests that the benefits of designing a 
dynamic policy tend to be greatest when the greedy index heuristic performs 
relatively less well. For one particular problem not recorded in Table 2 for 
which the greedy index heuristic had a cost rate excess over optimal of 
0.8801% that of the best static policy was 48.9693%. 

Example 2. All Example 2 problems concern the dynamic allocation 
of a pool of twenty-five servers (S = 25) to two service stations (K = 2). 
Service rate functions have the form 

Wfc(a) = (1 - exp(-arj k ))(j, k , k = l,2. 

Other details are similar to those of Example 1. Again, 4950 problems were 
generated at random in 99 sets of 50. For each problem the parameters 
Ai, \2,(J,i, fJ<2, were chosen by sampling independently from uniform 

distributions. While Table 1 gives details of the distributions used for some 
of the more challenging problems (G14, J14), full details may be found at 
http : //www . lums . lanes . ac . uk/ f iles/onlinesup . pdf . 

For each of the 4950 problems generated, the percentage cost rate excess 
of the greedy index heuristic over the optimum was computed. The overall 
results are presented in Table 3 along with those for problem sets G14 and 
J14 and for the best static policy. Similar comments apply as for Example 1. 

Table 3 

The percentage cost rate excess over optimum of (i) the greedy index heuristic 
for all 4950 Example 2 problems, (ii) for problem sets G14 and J14 and 
(iii) for the best static allocation policy 





Overall 


G14 


J14 


Static 


MIN 


0.0000 


0.0803 


0.0279 


2.2079 


LQ 


0.0024 


0.1473 


0.1100 


7.0473 


MED 


0.0087 


0.2164 


0.1495 


10.2092 


UQ 


0.0372 


0.4289 


0.2509 


14.4034 


MAX 


0.8469 


0.8469 


0.5905 


26.5599 


N 


4950 


50 


50 


4950 
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4. Spinning plates: Optimal investment in a collection of reward generat- 
ing assets. As a further illustration of the applicability of the methodology 
of Section 2, we now give a brief account of a setup in which a collection 
of K reward generating assets is maintained using a divisible investment 
resource. Each asset k evolves on its (finite) state space {0, 1, . . . ,A k } with 
higher-valued states being those in which the reward performance of the 
asset is stronger. In the absence of investment, assets tend to deteriorate 
toward lower- valued states. Positive investment both arrests the asset's ten- 
dency to deteriorate and enhances asset performance by enabling upward 
movement of the asset state. Investment decisions will often need to strike 
a balance between maintaining the performance of highly performing assets 
and improving the performance of poorly performing ones. Our model class 
represents a significant generalization of the spinning plates model of asset 
management discussed by Glazebrook, Kirkbride and Ruiz- Hernandez [11] 
to the case of a divisible resource. 

Formally, the system state at time t is n(t) = {ni(t),n2(t), . . . ,nx(t)}, 
where n k (t) is the state of asset k at t. The system state is observed continu- 
ously with decision epochs at time zero and at subsequent times at which the 
system state changes. An admissible action is a vector a = (a\,a2, ■ ■ ■ ,ok), 
with a k identified with the rate at which investment resource is applied to 
asset k, where a k € N, 1 < k < K, and Yl k a k — R- Note that R is the rate at 
which investment resource is available to the system. 

Functions A fc : {0, 1, . . . , R} x {0, 1, . . . , A k - 1} ->■ M+ and fj, k : {0, 1, . . . , R} 
x{l, 2, . . . , A k } — > M + are used in the specification of the transition law of 
asset k as follows: 

Qk(n k + 1 | n k , a k ) = X k (a k ,n k )I(n k < A k ) 

(33) 

(Investment enhances asset performance) 

and 

q k (n k - 1 | n k ,a k ) = fi k (a k ,n k )I(n k > 0) 

(34) 

(Investment arrests asset deterioration). 

All other transition rates for asset k are zero. We shall assume that X k (-,n k ) 
is strictly increasing and strictly concave Vn k € {0, 1, .. . ,A k — 1} and that 
^k(-,n-k) is strictly decreasing and strictly convex \/n k £ {1,2, . . . ,A k }. These 
conditions describe laws of diminishing returns as the level of investment to 
an asset increases, regardless of its state. It would be natural in many appli- 
cation contexts to further assume that each X k (a k , •) is decreasing and each 
Hk(a k , •) is increasing \/a k S {0, 1, . . . , R}, namely, that when an asset is in 
a higher-valued state, improvements take longer to achieve but asset dete- 
rioration occurs more rapidly. Our theoretical results do not require these 
latter conditions to hold, though they will feature in the problems analyzed 
in our numerical study. Finally, in state n, each asset k earns returns at rate 
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dk(nk), where d k : {0, 1, . . . , A k } — > M + is increasing. The dynamic resource 
allocation problem of interest is expressed as 

(35) = sup £ £> fc (u), 

ueu k 

while in (35), U is the set of DSM and admissible policies and D k (u) is the 
reward rate earned by asset k under policy u. 

4.1. Assets are fully indexable. Following a version of the development 
of Section 2 which focuses on reward maximization instead of cost minimiza- 
tion, we develop a Lagrangian relaxation of (35) which yields K single asset 
problems P(k, W), 1 < k < K, of the form 

(36) sup{D k (u k ) - WR k (u k )}. 

In (36), the supremum is over the class of DSM policies u k : {0, 1, . . . , A k } — > 
{0, 1, . . . , R} which can apply any resource level at asset k. Further, D k (u k ) 
is the asset k return rate under policy u k , while R k (u k ) is the rate of resource 
consumed. Full indexability of project k requires the existence of optimal 
policies for (36) which, in every state, apply a resource rate to the asset 
which is decreasing in the resource charge W. In discussing full indexability, 
we now drop the asset subscript k and use P(W) for the single asset problem 

(37) sup{D(n) - WR(u)}. 

u 

Following the approach of Section 3.1 we introduce the problem Q(h), de- 
fined by 

(38) sup{hD(u) - R(u)} 

u 

and argue that full indexability will be established by the existence of opti- 
mal policies for (38) which, in every state, choose resource levels which are 
increasing in the reward multiplier h. 

In order to develop the DP optimality equations for Q(h) we uniformize 
asset evolution by rescaling time such that 

(39) max {X(R, n) + u(0, n)\ = 1. 

0<n<A 

Under the rescaling in (39), we use 7(/i) for the maximal reward rate for Q(h) 
and v(h, •) for the corresponding bias function. The optimality equations 
may be written 

= -j(h) + hd(n) 

+ max[— a + A(a, n) Av (/i, n + l)I(n < A) 

(40) 

— /i(a, n)Av(h, n)I{n > 0)], 

< n < A. 
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In (40), we take Av(h,n) = v(h,n) — v(h,n — 1), 1 < n < A, and the maxi- 
mization is over < a < R. Lemma 10 uses (40) to give a characterization 
of optimal policies for Q(h). 

Lemma 10. A DSM policy u is optimal for Qih) if and only if 
Av(h,n + l)I(n < A)[X(u(n) + l,n) - X(u(n),n)] 
+ Av(h,n)I(n > 0)\p(u(n),n) — (J,(u(n) + l,n)] 
(41) < 1 < Av(h,n + l)I(n < A)[X(u(n),n) - X(u(n) - l,n)] 

+ Av(h, n)I(n > 0)[fj,(u(n) — l,n) — fj,(u(n),n)], 

0<n<A, 

where in (41) we take X(R + 1, •) = X(R, ■), A(— 1, •) = — 00, fi(R + 1, •) = 
M(^>-),m(-1,-) = oo. 

Remark. One important point of difference between our generalized 
spinning plates model and the queueing models of Section 3 is that the exis- 
tence of optimal policies for Q{h) which are monotone in state is no longer 
guaranteed, even for assets for which the transition rates are state mono- 
tone for any fixed resource level. Indeed, counter-examples are easy to find. 
The following asset appeared in the very first of 2000 randomly generated 
problems contributing to Table 5, which appears later in Section 4.2 as part 
of an extensive numerical investigation into the performance of the greedy 
index heuristic. 

We make the following asset choices: R = 5, A = 10 

A(o,n) = a(a + </>)~\ < a < 5, < n < 9, 

and 

H(a,n) = 4>(a + (f)y 1 ri, < a < 5, 1 < n < 10, 

where 4> = 1.30738 and i] = 1.16393. Further, the return for the asset is given 
by d(n) = n{n + 1) . In Table 4, find values of u(h,n),0 <n< 10, for seven 



Table 4 

Values of optimal actions (resource levels) for Q(h) for seven h-values and 
all states (leftmost entry) to 10 (rightmost entry) 



3 


4 


4 


4 


3 


3 


2 


2 


2 


1 





h = 7.37491 


2 


4 


4 


4 


3 


3 


2 


2 


2 


1 





h = 7.07632 


2 


4 


4 


3 


3 


3 


2 


2 


2 


1 





h = 5.32243 


2 


4 


3 


3 


3 


3 


2 


2 


2 


1 





h = 5.21572 


2 


3 


3 


3 


3 


3 


2 


2 


2 


1 





h = 4.98366 


2 


3 


3 


3 


3 


2 


2 


2 


2 


1 





h = 3.84063 


1 


3 


3 


3 


3 


2 


2 


2 


2 


1 





h = 3.48775 
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distinct values of h, where u(h, •) is an optimal policy for Q(h). Note that for 
the six open /i-intervals whose endpoints are the successive h values given 
in Table 4, the policy which sits alongside the value of h which is the lower 
endpoint is uniquely optimal throughout the interval. At no value of h in 
the range (3.48775,7.37491) is there an optimal policy for Q{h) which is 
monotone in state. Please note that the values in Table 4 are consistent 
with the asset's full indexability in that optimal actions for any given state 
are everywhere increasing in h over the range considered. 

We now consider the state process {n(t),t > 0} of a single asset evolving 
under some fixed DSM policy u for Q(h). We shall write 7(1*, fo) for the 
reward rate earned under policy u and v(u,h,-) for the corresponding bias 
function. Recall our earlier notational choices: if u(h) is optimal for Q(h) 
then ry(u(h),h) = j(h) and v(u(h), h, •) = v(h, •). 

Suppose now that n(0) = n£ [I, A]. We define the stopping times r(n, m \ n) 

by 



t(u, m | n) = inf {t > 0; n(t) = m}, 



< m < n < A, 



to be the first time after time at which the asset state enters m when 
policy u is applied throughout. We use 



(42) D(u,h,n) 
(43) 



hE 



r(u,0\n) 



d{n(t)}dt 
hx(u,n) -ip(u,n), 



— E 


[/ 




Jo 



rr(u,0\n) 



u{n(t)}dt 



l<n<A, 



for the expected reward (net of resource charges) earned by the asset evolving 
under policy u during [0,r(u, | n)) and 



(44) 



T(u,n) =E{t(u,0 I n)}, 



Kn<A. 



As in the proof of Lemma 7 we can use standard renewal arguments to infer 
that 

(45) v(u,h,n) = D(u,h,n) — ^(u,h)T(u,n), 1 < n < A, 

and hence that 

Av(u, h, n) = {D(u, h, n) — D(u, h,n — 1)} 

-j(u,h){T(u,n)-T(u,n-l)}, l<n<A. 
We now observe that taking n = 1 in (42)— (44) yields 

7 (n, h) = [h x (u, 1) - t/;(u, 1) + {MO) - u(0)}{A(«(0),0)}- 1 ] 
x[T(«,l) + {AK0),0)}- 1 ]- 1 . 



(46) 



(47) 
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Using (47) in (46) we observe that, for any fixed u, n where I < n < A, Av(u, 
h, n) is affine in h with /i-gradient proportional to 



X (u, n) - xju, n - 1) x (u, 1) + rf(0){A(n(0), O)}' 1 
T(u,n) -T(u,n-1) T(u, 1) + {A(u(0), 0)}- 1 

_ E[£ (u ' n - lln) d{n(t)}dt] E[£ {um d{n(t)}dt] + d(0){A(u(0),0)}- 



which is easily seen to be positive since the return rate d(-) is increasing 
in the state. We infer that Av (u, -,n) is increasing for any fixed u,n where 
1 < n < A. It must, therefore, follow that Av(-,n) is increasing over any 
/i-interval for which there exists some fixed policy u(h) which is strictly 
optimal for Q(h). 

We can now deploy arguments along the lines of those in the proof of 
Theorem 9 to infer Theorem 11 (i) . Please note that Theorem 1 1 (ii) follows 
straightforward from Theorem 11 (i) together with Lemma 10 and the con- 
ditions on the transition rates enunciated after (34). This result generalizes 
Theorem 1 of Glazebrook, Kirkbride and Ruiz-Hernandez [11] to the divisi- 
ble resource case. 

Theorem 11 (Full indexability). (i) The functions Av(-,n) are increas- 
ing Vn, 1 < n < A; (ii) the asset is fully indexable. 

We apply an algorithm similar to that in Section 3.2 to infer the sequence 
of optimal policies as h decreases from some large value for which the optimal 
policy uses maximal resource R in every state below A. Indices are now not 
in general monotone in state. 

4.2. Numerical study. We proceed to assess the quality of the greedy 
index heuristic through a study of 14,000 randomly generated two asset 
problems (K = 2) in which resource is available to the assets in integer 
amounts up to a maximum of 5 or 10 (R = 5 or 10). All assets studied 
evolve over the state space {0, 1, ... , 10} while the transition rates for each 
asset k are assumed to be multiplicatively separable such that 



(48) Xk(a k ,n k ) = a k (a k + 4> k ) £ k {n k ), < a k < R, < n k < 9, 



(49) fi k {a k ,?i k ) = (/) k (a k + (p k ) 1 ?] k {n k ), < a k < R, 1 < n k < 10, 

with (j> k a positive constant. In all 14,000 problems the 4> k will be obtained 
by sampling from the uniform distribution on [0.75,5.00]. The assets are as- 
sumed always to have a common return function, denoted d : {0, 1, . . . , 10} — > 
]R + , which is increasing. 



> 




E{t(«,0|1)} + {A(«(0),0)}- 



and 
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In all problems we compare the performance of three heuristic policies for 
resource allocation. These are the greedy index policy (Index), the optimal 
static policy (Static) and a myopic policy (Myopic) which in every system 
state n= {ni^n?) chooses an action a= (01,02) to maximize the rate at 
which the return rate from the assets increases, namely, 
2 

maxVVA fc (a fc ,n fc )I(n fc < 10){d(n fc + 1) - d(n k )} 

a * — • 

k=l 

+ n k (a k ,n k )I(n k >0){d(n k - 1) - d(n k )}]. 

For each problem instance, the return rate achieved under each heuristic is 
compared to optimum and reported as a percentage suboptimality. All com- 
putations utilize DP value iteration. The problems are generated in seven 
groups with 2000 problems in each group. For each group of problems and 
each heuristic the 2000 percentage suboptimalities are summarized using 
order statistics, as was done in Section 3.3. The results are presented in 
Tables 5-8. The problem details now follow. 

The results in Table 5 concern a very simple model in which the transition 
rates are state independent. We take = l,k = 1,2, while the rj k (-) also 
are constant, with values obtained by sampling from the uniform distribu- 
tion on [0.75,1.25]. Resource is available to the assets at total rate R = 5 
throughout. In all cases, asset return rates are increasing concave in the 
asset state and given by 

d(n) =n(n+ 1) _1 , < n < 10. 

These asset management problems prove challenging and the myopic pro- 
posal performs poorly in Table 5, being consistently outperformed by both 
Index and Static. Over the 2000 problems sampled, the percentage subop- 
timality of Index is roughly uniformly distributed on the interval [0.0, 1.9], 
while that for Static is also roughly uniform, but across the considerably 
wider range [0.0,13.7]. 

Table 5 

The percentage return rate below optimum of (i) the 
greedy index heuristic, (ii) the best static allocation 
policy and (iii) a myopic policy for 2000 problems with 
state independent transition rates. See text for details 





Index 


Static 


Myopic 


MIN 


0.0000 


0.0719 


0.0027 


LQ 


0.1482 


3.7812 


4.7774 


MED 


0.6752 


6.1724 


16.7270 


UQ 


1.0751 


7.4822 


26.5042 


MAX 


1.9082 


13.6966 


39.3193 


N 


2000 


2000 


2000 
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Table 6 

The percentage return rate below optimum of (i) the greedy 

index heuristic, (ii) the best static allocation policy and 
(iii) a myopic policy for 2000 problems with state dependent 
transition rates. See text for details 





Index 


Static 


Myopic 


MIN 


0.0000 


0.0305 


2.2993 


LQ 


0.0000 


0.0695 


6.7075 


MED 


0.0001 


0.1179 


13.0721 


UQ 


0.0008 


0.1888 


17.9062 


MAX 


0.9685 


1.0340 


23.0439 


N 


2000 


2000 


2000 



For the next group of problems we set R = 10 and introduce state depen- 
dence into the transition rates. In (48) and (49) we take 

(50) a(«fc) = {ll Qfc -(n fc + l) Q n(rifc + l)~ Q&+1 , 0<n fc <9, 
and 

(51) Vk(n k )=n k , l<n fc <10, 

where in (50) and (51), a k > 1 is a positive constant. The choices in (50), (51) 
feature in the numerical study undertaken by Glazebrook, Kirkbride and 
Ruiz-Hernandez [11] of their much simpler spinning plates model. The func- 
tion in (50) is decreasing and convex over the range < n k < 9. The 
degree of curvature of the function and the value of £fc(0) both increase with 
the value of a k . For the problems featured in Table 6, we obtain the a k 
by sampling from the uniform distribution on [1.05,1.50]. Here the models 
are such that achieving improvements to asset performance is increasingly 
difficult for higher states. This effect will be most marked when a k is close 

Table 7 

The percentage return rate below optimum of (i) the greedy 
index heuristic, (ii) the best static allocation policy and 
(iii) a myopic policy for 2000 problems with state independent 
transition rates. See text for details 





Index 


Static 


Myopic 


MIN 


0.0000 


0.1830 


1.2736 


LQ 


0.0000 


0.3275 


1.7252 


MED 


0.0001 


0.3817 


1.9311 


UQ 


0.0012 


0.4652 


2.5708 


MAX 


0.0095 


0.7310 


16.1912 


N 


2000 


2000 


2000 
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Table 8 

The percentage return rate below optimum of (i) the greedy index heuristic, 
(ii) the best static allocation policy and (iii) a myopic policy for 8000 problems 
with state dependent transition rates. See text for details 





Index 


Static 


Myopic 


Index 


Static 


Myopic 




(a) 


a k ~ Z7[1.05, 1.20] 


(b) a k 


~ E7[1.20, 1.35] 




MIN 


0.0000 


0.0187 


1.2278 


0.0000 


0.0987 


1.1529 


LQ 


0.2446 


4.7749 


2.4854 


0.0556 


8.3715 


2.6063 


MED 


0.6471 


10.9720 


4.5413 


0.5215 


14.7471 


4.9759 


UQ 


2.6301 


17.0301 


7.0980 


2.0182 


21.1644 


8.8432 


MAX 


10.8450 


28.0785 


22.3554 


9.5897 


31.7000 


22.5440 


N 


2000 


2000 


2000 


2000 


2000 


2000 




(c) 


a k ~ {/[1.35, 1.50] 


(d) a k 


~{7[1.50,1.65] 




MIN 


0.0000 


0.3388 


1.1130 


0.0000 


0.9814 


0.9718 


LQ 


0.0122 


11.2186 


2.8107 


0.0034 


14.3835 


3.6829 


MED 


0.2554 


17.4297 


5.9093 


0.1743 


21.1017 


7.6089 


UQ 


1.7601 


24.0923 


10.6612 


1.6311 


27.6231 


13.4215 


MAX 


8.0043 


33.8457 


22.7322 


6.4821 


36.3746 


24.4466 


N 


2000 


2000 


2000 


2000 


2000 


2000 



to the top of its range. Finally, our choice of asset return rate is 

f0, 0<n<4, 
(52) d(n) = (n-4)/5, 5 < n < 8, 

U, n = 9,10. 

Here state 9 is the minimum for an asset to generate returns at maximal 
rate. Further, should an asset deteriorate to the point that its state is 4 or 
less it is incapable of generating any returns. In contrast to the problems 
featured in Table 5, this return is nonconcave in state. 

Please find the results for this group of 2000 problems in Table 6. In 
Table 7 we consider a slightly modified set of such problems for which R = 5 
and the downward transition rates are given by 

Vk(n k ) = 0.5n k , 1 < n k < 10. 

The problems featured in Tables 6 and 7 prove relatively unchalleng- 
ing to both Index and Static, in part because of the highly discrepant up- 
ward transition rates obtained from distinct a k . If we tame this feature by 
rescaling the functions (after a k has been chosen) such that £^(0) is a 
fixed amount (here taken to be 12) then the problems become very much 
more difficult and the performance of Static can become quite poor. Table 8 
features 8000 such problems. The subtables correspond to distinct ranges 
for the sampled a k . In Table 8(a)— 8(d) we have a k ~ f/[1.05, 1.20], a/. ~ 
U[1.20, 1.35], a k ~ Z7[1.35, 1.50] and a k ~ J7[1.50, 1.65], respectively Prob- 
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lem details are otherwise as for Table 6. From Table 8, the relatively poor 
performance of both Static and Myopic makes it clear that these are diffi- 
cult problems for which dynamic policies, which take adequate account of 
the future impact of current decisions, really are needed. The greedy index 
heuristic delivers a readily understood proposal which continues to perform 
robustly even in this very challenging problem environment. It is especially 
effective for the problems with larger sampled in which it is most difficult 
to maintain assets in strongly performing states. 

5. Conclusions and proposals for further work. The paper has described 
radical extensions to index theory which facilitate the analysis of dynamic 
resource allocation problems in which a single key resource may be assigned 
more flexibly than is allowed in classical bandit models. The resulting greedy 
index heuristic has been shown to perform strongly for a range of models 
which relate to applications, inter alia, in queueing control and asset man- 
agement which are of independent interest. 

Without doubt, the primary obstacle to general implementation of the ap- 
proach described concerns the requirement to establish full indexability. This 
is that optimal solutions to the single project problems P(k, W), 1 < k < K, 
derived from a Lagrangian relaxation of the original problem, exhibit a prop- 
erty of assigning diminishing levels of resource uniformly over project states 
as the resource charge W increases. While we have been able to demonstrate 
this for the models of Sections 3 and 4, it presents a formidable challenge 
in many problems. We propose to develop our approach further by explor- 
ing the quality of index heuristics derived from strongly performing (though 
possibly not optimal) policies for the P(k,W),l < k < K , which have the 
above structural property required to create indices. 

Acknowledgment. We gratefully acknowledge the helpful comments of 
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